Comparative evaluation of term selection functions for authorship attribution

نویسنده

  • Jacques Savoy
چکیده

Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part-of-speech, bigram of words, etc.) to discriminate between different authors. To achieve this, we can consider different independent feature-scoring selection functions (information gain, gain ratio, pointwise mutual information, odds ratio, chi-square, bi-normal separation, GSS, Darmstadt Indexing Approach (DIA), and correlation coefficient). Other term selection strategies have also been suggested in specific authorship attribution studies. To compare these two families of selection procedures, we have extracted articles from two newspapers and belonging to two categories (sports and politics). To enlarge the basis of our evaluations, we have chosen one newspaper written in the English language (‘Glasgow Herald’) and a second one in Italian (‘La Stampa’). The resulting collections contain from 987 to 2,036 articles written by four to ten columnists. Using the Kullback–Leibler divergence, the chisquare measure and the Delta rule as attribution schemes, this study found that some simple selection strategies (based on occurrence frequency or document frequency) may produce similar, and sometimes better, results compared with more complex ones. .................................................................................................................................................................................

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Etude comparative de stratégies de sélection de prédicteurs pour l'attribution d'auteur

The authorship attribution problem can be viewed as a categorization problem. To determine the most effective features to discriminate between different writers (or categories), we have evaluated seven feature selection functions (e.g., pointwise mutual information, information gain, odds ratio, !, or correlation coefficient). We have also considered two selection functions proposed in the cont...

متن کامل

Application of PROMETHEE method for green supplier selection: a comparative result based on preference functions

The PROMETHEE is a significant method for evaluating alternatives with respect to criteria in multi-criteria decision-making problems. It is characterized by many types of preference functions that are used for assigning the differences between alternatives in judgements. This paper proposes a preference of green suppliers using the PROMETHEE under the usual criterion preference functions. Comp...

متن کامل

Authorship Attribution in Bengali Language

We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...

متن کامل

Towards a better understanding of Burrows's Delta in literary authorship attribution

Burrows’s Delta is the most established measure for stylometric difference in literary authorship attribution. Several improvements on the original Delta have been proposed. However, a recent empirical study showed that none of the proposed variants constitute a major improvement in terms of authorship attribution performance. With this paper, we try to improve our understanding of how and why ...

متن کامل

EPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011

This paper describes the participation of the PISIS team in the authorship identification track of PAN’11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • DSH

دوره 30  شماره 

صفحات  -

تاریخ انتشار 2015